using Dates
using MultivariateStats
using Plots
using NCDatasets
using StatsBase
using Unitful
using DataFrames
using Distributions
using StatsPlots
using Turing
using DynamicHMCProject 01
Distributional Downscaling
1 Exectuive Summary
In this project, my objective was to model hourly precipitation at a single location given the total precipitation of that day. There are many ways to approach this problem. I opted to do this using distributional methods. Initially, I created a generalize linear model that related the hourly 2-meter temperature to a distribution for hourly precipitation. Then I created a linear regression model that related the hourly 2-meter temperature, the hourly 500hPa temperature, and the 500 hPa relative humidity to the distribution of hourly precipitation. The precipitation could then be estimated via sampling from the distribution. If there are concerns about the correlational support between temperature and precipitation, they are valid, because model performance was laughably bad. Quantile-Quanitle plots between training data and model output showed no clear trends, meaning that my models do serve well as forecasting tools. However, my models did perform better than the null hypothesis model, which argues that precipitation can be estimated without any known environmental conditions.
2 Data Import
3 Data Exploration
3.1 Distribution of Hourly Temperature and Precipitation (all ERA5)
3.2 Exploring Distribution Fitting
plot_dist (generic function with 1 method)
The distribution of rainfall is not gaussian in any way. We’ve discussed in class that there are a series of distributions we should try fitting. First try a log-normal distribution, then a log pearson III distribution, and a extreme value distribution for last resort. I first fitted the data to a log-normal distribution and it looked pretty good (@lognormal). Unfortunatley, I couldn’t fit a Pareto or a Weibull distribution without getting errors or bogus distributions, so I will stick with a log-normal distribution.
I also checked the distribution of precipitation given certain ranges of temperature. I wanted to see if the distribution parameters changed at all, and they did change throught the temperature brackets (@lognormal-cuts .b).
plot_sliced_precip (generic function with 1 method)
i) LogNormal{Float64}(μ=-2.5389778370028733, σ=2.6045064318088427)
ii) LogNormal{Float64}(μ=-3.0652755422578033, σ=2.5621262450568993)
iii) LogNormal{Float64}(μ=-2.925626680304606, σ=2.2865827071057523)
iv) LogNormal{Float64}(μ=-3.748743357580327, σ=1.8380935211035427)
4 Methods
I created three models for this project. The first was a null model that assumed an uniform distribution of hourly precipitation, such that for each hour of rainfall, any intensity would have the same probability density. The second and third model both assume that probability density of a given precipitation intensity is described by a Log-Normal distribution with parameters \(\mu\) and \(\sigma\). For the second model, the distribution was fit based on Equation 1 and Equation 2, which were based on GLM methods. For the third model, the distribution was fit based on ?@eq-c and ?@eq-d, which were based on linear regression methods. I acknowledge that it may be a strain to claim that model II is a GLM. The coefficients were determined using MCMC with chain lengths of 100 iterations. The chain length was small to maintain reasonable computing times. Parameter estimation was done on 2019 data and results were tested over 2020 data. Since the model is stochastic in nature, I ran the models 100 times over the testing data and collected the summary statistics of the model results.
4.1 Parameter estimation for model II
In model II, both \(\mu\) and \(\sigma\) were assumed to correlate with 2-meter temperature. The mean is estimated with
\[ \mu_i = \alpha+\beta T_{2,i}, \tag{1}\]
where priors for \(\alpha\) and \(\beta\) were both standard normal distributions, and \(T_i\) is observed 2-meter temperature. The standard deviation is estimated with a homologous equation, that is,
\[ \sigma_i = | \gamma+\chi T_{2,i} |, \tag{2}\]
where priors for \(\gamma\) and \(\chi\) were both standard normal distributions. I have chosen to use the identity link function for equation Equation 1 and a non-cononical absolute value for Equation 2. The idenity link function was chosen because \(\mu \in (-\infty,\infty)\) and, the absolute value function was chosen because \(\sigma \in (0,\infty)\).
4.2 Parameter estimation for model III
In model III, I assumed that the \(\mu\) correlated with temperature and humidity at geopotential of 500 hPa (\(T_{500,i}\), \(H_{i}\) respectively), as well as 2-meter temperature. That is,
\[ \mu_i = \alpha + \beta_1 T_{500,i} + \beta_2 H_{i} + \beta_3 T_{2,i}, \] where all coefficient priors, as well as \(\sigma\), were standard normal distributions.
4.3 Using Observed Daily Total Precipitation as a Boundary Condition
If I sampled from a distribution for \(y_i(x_i |\theta)\) twentyfour times, the sum would not be guarranteed to stay within the observed daily precipitation for that time period. To prevent overestimating daily precipitation, I reject any sample of the distribution of \(y_i(x_i|\theta)\) that put the total daily precipitation above the corresponding observation.
4.4 Testing
5 Results
5.1 MCMC Paramaterization Results
Though sample sizes were small, the parameters appear to be converging and rhat values are around 1.0. I am moderatly confident that the MCMC parameterization behaves well.
Summary Statistics parameters mean std mcse ess_bulk ess_tail rhat e ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ α 1.4376 0.9580 0.3180 8.3109 31.7840 1.0864 ⋯ β -0.0153 0.0032 0.0011 8.2433 38.2860 1.0835 ⋯ γ -4.3276 0.8355 0.5904 2.1030 21.0246 1.5158 ⋯ χ 0.0067 0.0029 0.0020 2.0740 21.0246 1.5245 ⋯ 1 column omitted
Montecarlo Markov-chain results for fitting parameters for model III.
Summary Statistics parameters mean std mcse ess_bulk ess_tail rhat e ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ α -0.9685 0.9921 0.1363 54.9279 31.6824 1.0006 ⋯ β1 -0.0348 0.0067 0.0010 45.6807 48.7252 1.0230 ⋯ β2 0.0239 0.0016 0.0002 82.1368 54.8565 1.0213 ⋯ β3 0.0265 0.0072 0.0011 43.5216 78.3393 0.9900 ⋯ σ 2.2624 0.0411 0.0062 43.5172 34.2306 1.0202 ⋯ 1 column omitted
5.2 Model Comparison
All three models that I made were terrible at reconstructing real observations. The scatter plots show that predictions and observations are quite inconsistent with each other (Figure 8,Figure 9,Figure 10). Furthermore, the high log-likelihood of low precipitation prediction can be directly attributed to the distribution function used. Ideally, high-likelihood values have further spread into higher precipitation predictions. Perhaps this can be done with bayesian techniques that integrate daily total precipitation into the PDF directly, instead of the sampling trick that I used.
Despite the overwhelming sense of poor performance from the qualitative side, Models II and III do have lower mean squarred errors when compared to the null model (Figure 8,Figure 9,Figure 10). This suggests that, even though predictions are pretty much pure guess work, using prior knowledge about the climatology does produce better guess. However, there does not seem to be much difference due to the implementation of the model and the information provided to the models.
I have also collected other summary statistics that provide insight onto model performance such as the total modeled precipitation and the model-observation correlation. The total modeled precipiation is best reconstructed by Model I, which is not surprising since the sampling distribution favors no precipitation intensity. The total modeled precipitation is significantly less the the actual annual total precipitation for models II and III. This is also expected as the log-normal distributions each point is sampled from favors low intensity precipitation values much more than high intensity precipitation.
| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Number | Number | Number | Number | Int64 | DataType | |
| 1 | total_precip | 1147.01 mm | 1144.43 mm | 1144.43 mm | 1404.99 mm | 0 | Quantity{Float64, 𝐋, FreeUnits{(mm,), 𝐋, nothing}} |
| 2 | total_modeled_precip | 1146.34 mm | 1143.49 mm | 1143.77 mm | 1404.27 mm | 0 | Quantity{Float64, 𝐋, FreeUnits{(mm,), 𝐋, nothing}} |
| 3 | model_likelihood | 707.32 | 365.861 | 712.44 | 1011.63 | 0 | Float64 |
| 4 | model_obs_correlation | 0.0960184 | 0.0668904 | 0.0946014 | 0.177868 | 0 | Float64 |
| 5 | mse | 1.45541 mm² | 1.18064 mm² | 1.44745 mm² | 2.07862 mm² | 0 | Quantity{Float64, 𝐋², FreeUnits{(mm²,), 𝐋², nothing}} |
| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Number | Number | Number | Number | Int64 | DataType | |
| 1 | total_precip | 1147.01 mm | 1144.43 mm | 1144.43 mm | 1404.99 mm | 0 | Quantity{Float64, 𝐋, FreeUnits{(mm,), 𝐋, nothing}} |
| 2 | total_modeled_precip | 678.167 mm | 605.361 mm | 677.79 mm | 750.639 mm | 0 | Quantity{Float64, 𝐋, FreeUnits{(mm,), 𝐋, nothing}} |
| 3 | model_likelihood | 9329.05 | 8869.35 | 9370.57 | 9937.11 | 0 | Float64 |
| 4 | model_obs_correlation | 0.124325 | 0.068071 | 0.12131 | 0.258016 | 0 | Float64 |
| 5 | mse | 0.53666 mm² | 0.421163 mm² | 0.528945 mm² | 0.703593 mm² | 0 | Quantity{Float64, 𝐋², FreeUnits{(mm²,), 𝐋², nothing}} |
| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Number | Number | Number | Number | Int64 | DataType | |
| 1 | total_precip | 1147.01 mm | 1144.43 mm | 1144.43 mm | 1404.99 mm | 0 | Quantity{Float64, 𝐋, FreeUnits{(mm,), 𝐋, nothing}} |
| 2 | total_modeled_precip | 747.919 mm | 651.62 mm | 749.794 mm | 880.444 mm | 0 | Quantity{Float64, 𝐋, FreeUnits{(mm,), 𝐋, nothing}} |
| 3 | model_likelihood | 9174.37 | 8680.74 | 9182.31 | 9683.36 | 0 | Float64 |
| 4 | model_obs_correlation | 0.180826 | 0.110423 | 0.173946 | 0.375689 | 0 | Float64 |
| 5 | mse | 0.549778 mm² | 0.434284 mm² | 0.539754 mm² | 0.785719 mm² | 0 | Quantity{Float64, 𝐋², FreeUnits{(mm²,), 𝐋², nothing}} |
6 Conclusion
In this project I have explored the use of distributional downscaling for precipitation time series. I have also naively implemented an algorithm to digest available observations in an attempt to ground my distributional model. This method has several limitations, some are characteristic of the method, while others are specific to my implementations. In general, distributional downscaling is not suitable for making climate forcasts. Instead, it is good for providing a statistical assessment for a given set of predictors. Unique to my model, is the tendency to underpredict precipitation, which seems similar to dreary bias that we’ve learned in class. In order to improve upon my work, I would need to find a better method for assimilating existing observations during the downscaling process, perhaps with bayesian methods that update probability of certain rainfall using observations and continuity rules.